传统文本分类方法通常需要良好数量的标记数据,这很难获得,尤其是限制域或较少的广泛语言。这种缺乏标记的数据导致了低资源方法的兴起,这在自然语言处理中具有低数据可用性。其中,零射击学习脱颖而出,它包括在没有任何先前标记的数据的情况下学习分类器。通过此方法报告的最佳结果使用变压器等语言模型,但下降到两个问题:高执行时间和无法处理长文本作为输入。本文提出了一种新的模型Zeroberto,它利用无监督的聚类步骤来获得分类任务之前的压缩数据表示。我们展示Zeroberto对长输入和更短的执行时间具有更好的性能,在FOLHauol数据集中的F1分数中表现出XLM-R大约12%。关键词:低资源NLP,未标记的数据,零射击学习,主题建模,变形金刚。
translated by 谷歌翻译
受试者经常与若干参与者的中等辩论经常变化,例如议会会议,选举辩论和审判。将争论分组到具有相同主题的块是必不可少的理解。通常,主持人负责在新块开始时定义,以便自动划分审核辩论的任务可以完全关注主持人的行为。在本文中,我们(i)提出了一种新的算法,Debacer,其审议审查辩论;(ii)在常规和Bertimbau管道之间进行比较研究;(iii)验证将其申请到葡萄牙共和国大会的分钟。我们的结果显示了Debacer的有效性。关键词:自然语言处理,政治文件,口语文本处理,语音分裂,对话分区。
translated by 谷歌翻译
维基百科是可理解知识的重要自由来源。尽管如此,巴西葡萄牙维基百科仍然缺乏对许多科目的描述。为了扩大巴西维基百科,我们贡献了Plsum,这是一种从多个描述性网站生成类似的Wiki的抽象摘要的框架。该框架具有提取阶段,然后是抽象。特别是,对于抽象阶段,我们微调并比较了变压器神经网络,PTT5和啰覆的最近最近的变化。为了微调和评估模型,我们创建了一个具有数千个示例的数据集,将参考网站链接到维基百科。我们的结果表明,可以从巴西葡萄牙语网上内容生成有意义的抽象摘要。
translated by 谷歌翻译
恶意软件是对计算机系统的主要威胁,并对网络安全构成了许多挑战。有针对性的威胁(例如勒索软件)每年造成数百万美元的损失。恶意软件感染的不断增加一直激励流行抗病毒(AV)制定专用的检测策略,其中包括精心制作的机器学习(ML)管道。但是,恶意软件开发人员不断地将样品的功能更改为绕过检测。恶意软件样品的这种恒定演变导致数据分布(即概念漂移)直接影响ML模型检测率,这是大多数文献工作中未考虑的。在这项工作中,我们评估了两个Android数据集的概念漂移对恶意软件分类器的影响:DREBIN(约130k应用程序)和Androzoo(约350K应用程序)的子集。我们使用这些数据集训练自适应随机森林(ARF)分类器以及随机梯度下降(SGD)分类器。我们还使用其Virustotal提交时间戳订购了所有数据集样品,然后使用两种算法(Word2Vec和tf-idf)从其文本属性中提取功能。然后,我们进行了实验,以比较两个特征提取器,分类器以及四个漂移检测器(DDM,EDDM,ADWIN和KSWIN),以确定真实环境的最佳方法。最后,我们比较一些减轻概念漂移的可能方法,并提出了一种新的数据流管道,该管道同时更新分类器和特征提取器。为此,我们通过(i)对9年来收集的恶意软件样本进行了纵向评估(2009- 2018年),(ii)审查概念漂移检测算法以证明其普遍性,(iii)比较不同的ML方法来减轻此问题,(iv)提出了超过文献方法的ML数据流管道。
translated by 谷歌翻译
Stress has a great effect on people's lives that can not be understated. While it can be good, since it helps humans to adapt to new and different situations, it can also be harmful when not dealt with properly, leading to chronic stress. The objective of this paper is developing a stress monitoring solution, that can be used in real life, while being able to tackle this challenge in a positive way. The SMILE data set was provided to team Anxolotl, and all it was needed was to develop a robust model. We developed a supervised learning model for classification in Python, presenting the final result of 64.1% in accuracy and a f1-score of 54.96%. The resulting solution stood the robustness test, presenting low variation between runs, which was a major point for it's possible integration in the Anxolotl app in the future.
translated by 谷歌翻译
Recently, extensive studies on photonic reinforcement learning to accelerate the process of calculation by exploiting the physical nature of light have been conducted. Previous studies utilized quantum interference of photons to achieve collective decision-making without choice conflicts when solving the competitive multi-armed bandit problem, a fundamental example of reinforcement learning. However, the bandit problem deals with a static environment where the agent's action does not influence the reward probabilities. This study aims to extend the conventional approach to a more general multi-agent reinforcement learning targeting the grid world problem. Unlike the conventional approach, the proposed scheme deals with a dynamic environment where the reward changes because of agents' actions. A successful photonic reinforcement learning scheme requires both a photonic system that contributes to the quality of learning and a suitable algorithm. This study proposes a novel learning algorithm, discontinuous bandit Q-learning, in view of a potential photonic implementation. Here, state-action pairs in the environment are regarded as slot machines in the context of the bandit problem and an updated amount of Q-value is regarded as the reward of the bandit problem. We perform numerical simulations to validate the effectiveness of the bandit algorithm. In addition, we propose a multi-agent architecture in which agents are indirectly connected through quantum interference of light and quantum principles ensure the conflict-free property of state-action pair selections among agents. We demonstrate that multi-agent reinforcement learning can be accelerated owing to conflict avoidance among multiple agents.
translated by 谷歌翻译
Code generation from text requires understanding the user's intent from a natural language description (NLD) and generating an executable program code snippet that satisfies this intent. While recent pretrained language models (PLMs) demonstrate remarkable performance for this task, these models fail when the given NLD is ambiguous due to the lack of enough specifications for generating a high-quality code snippet. In this work, we introduce a novel and more realistic setup for this task. We hypothesize that ambiguities in the specifications of an NLD are resolved by asking clarification questions (CQs). Therefore, we collect and introduce a new dataset named CodeClarQA containing NLD-Code pairs with created CQAs. We evaluate the performance of PLMs for code generation on our dataset. The empirical results support our hypothesis that clarifications result in more precise generated code, as shown by an improvement of 17.52 in BLEU, 12.72 in CodeBLEU, and 7.7\% in the exact match. Alongside this, our task and dataset introduce new challenges to the community, including when and what CQs should be asked.
translated by 谷歌翻译
Neural machine translation (NMT) has become the de-facto standard in real-world machine translation applications. However, NMT models can unpredictably produce severely pathological translations, known as hallucinations, that seriously undermine user trust. It becomes thus crucial to implement effective preventive strategies to guarantee their proper functioning. In this paper, we address the problem of hallucination detection in NMT by following a simple intuition: as hallucinations are detached from the source content, they exhibit encoder-decoder attention patterns that are statistically different from those of good quality translations. We frame this problem with an optimal transport formulation and propose a fully unsupervised, plug-in detector that can be used with any attention-based NMT model. Experimental results show that our detector not only outperforms all previous model-based detectors, but is also competitive with detectors that employ large models trained on millions of samples.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Learning-based image compression has improved to a level where it can outperform traditional image codecs such as HEVC and VVC in terms of coding performance. In addition to good compression performance, device interoperability is essential for a compression codec to be deployed, i.e., encoding and decoding on different CPUs or GPUs should be error-free and with negligible performance reduction. In this paper, we present a method to solve the device interoperability problem of a state-of-the-art image compression network. We implement quantization to entropy networks which output entropy parameters. We suggest a simple method which can ensure cross-platform encoding and decoding, and can be implemented quickly with minor performance deviation, of 0.3% BD-rate, from floating point model results.
translated by 谷歌翻译